This document outlines the process of creating a Random Forest predictive model in R. Our model will attempt to predict whether or not a passenger on the Titanic would have survived based on the information they provided before boarding.
To construt our model, we have already downloaded the following packages: rpart, rattle, caret, ROCR, and randomForest. We have also cleaned our data so that there are no missing variables.
When using decision trees or random forest modeling, it is important to divide your data between a training dataset and a test dataset. The training dataset is what we will use to construct our model. The tesing dataset allows us to see how effectively our model can predict the results of unexamined data.
# Split the data into two sets
# 50% of the sample size
smp_size <- floor(0.5 * nrow(data))
# Set the seed to make your partition reproductible
set.seed(100)
trainindex <- sample(seq_len(nrow(data)), size = smp_size)
train <- data[trainindex, ]
test <- data[-trainindex, ]
For any random forest model, it is okay to include all available variables. Rather than the user deciding which variables are important, random forest models will automatically optimize and rank the most influential variables.
This model will examine whether or not someone survived based on their social class, sex, age, number of siblings and spouse on board, number of parents and children on board, ticket price, and embarking location.
The model will use the train dataset to create 1000 decisions trees. The random forest will use the findings of the decisions trees to develop a categorizing process for predicted survival.
# Random Forest
fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = train, importance = TRUE, ntree = 1000)
Below is the plot of our random forest. We can see how our error rate decreases and plateaus as we examine more decision trees.
plot(fit)
Next, we can review which variables had the most useful predictive value for assessing whether someone will survive or not on the Titanic.
# Analyze importance of explanatory variables
importance(fit)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## Pclass 12.895497 34.989554 36.86166 15.260668
## Sex 70.551059 88.144675 96.18286 50.186074
## Age 16.116902 19.090688 24.59926 32.970913
## SibSp 17.436767 -6.798807 12.32441 8.140573
## Parch 18.456035 10.265035 21.74832 9.378461
## Fare 14.981120 26.043025 30.64243 36.720307
## Embarked 7.384013 14.373400 16.54230 6.271409
varImpPlot(fit)
Here we will use the categorizing process of the random forest produced with the train dataset to predict who survived from the test dataset.
First, we must create a data frame of predictions for the test data, based on the random forest model.
pred_data <- data.frame(predict(fit, test, type = "class"))
Second, we can conduct a simple misclassificaiton test. This tells us the broad accuracy of our model, which is when the predicted value matches the actual value for a passenger’s survival status.
misclassificationError <- mean(pred_data != test$Survived)
print(paste('Accuracy',1-misclassificationError))
## [1] "Accuracy 0.834080717488789"
Third, we will construct a confusion matrix to analyze how often our correct guesses are True-Positives and True-Negatives. A confusion matrix also allows us to see how often our model produces False-Positives and False-Negatives.
confusionmat <- confusionMatrix(pred_data[[1]],test$Survived)
confusionmat
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 258 51
## 1 23 114
##
## Accuracy : 0.8341
## 95% CI : (0.7962, 0.8674)
## No Information Rate : 0.63
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6312
## Mcnemar's Test P-Value : 0.001697
##
## Sensitivity : 0.9181
## Specificity : 0.6909
## Pos Pred Value : 0.8350
## Neg Pred Value : 0.8321
## Prevalence : 0.6300
## Detection Rate : 0.5785
## Detection Prevalence : 0.6928
## Balanced Accuracy : 0.8045
##
## 'Positive' Class : 0
##
Fourth (and finally), a ROC curve will let us review an aggregate result of the confusion matrices that our model produces when we put increasing priority on predicting True-Positives or True-Negatives. A better model has a larger area under the curve (AUC) as we trace the results of these prioritizations.
predroc <- data.frame(predict(fit, test, type = "prob"))
pr <- prediction(predroc[2], test$Survived)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8718861
Random Forest Model